This paper proposes a two-stream flow-guided convolutional attention networksfor action recognition in videos. The central idea is that optical flows, whenproperly compensated for the camera motion, can be used to guide attention tothe human foreground. We thus develop cross-link layers from the temporalnetwork (trained on flows) to the spatial network (trained on RGB frames).These cross-link layers guide the spatial-stream to pay more attention to thehuman foreground areas and be less affected by background clutter. We obtainpromising performances with our approach on the UCF101, HMDB51 and Hollywood2datasets.
展开▼